Data Wrangling in R

Xiaorui (Jeremy) Zhu

02/10/2024

Importance of Data Wrangling!

“Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.” – Jeff Leek

Most Time-Consuming Data Science Task: Data Cleaning (80% of the work)

Data preparation accounts for about 80% of the work of data scientists

Data Cleaning: ensure \(\color{red}{\text{reliability}}\) and \(\color{red}{\text{quality}}\)

In business analytics, data comes from various sources such as databases, spreadsheets, and external sources. These sources often contain missing values, duplicate entries, outliers, and formatting inconsistencies, which can strongly skew the analysis results and lead to incorrect business decisions if not addressed properly.

Data cleaning is a crucial step in the process of preparing data for analysis in the field of business analytics. It involves identifying and correcting errors, inconsistencies, and inaccuracies in the dataset to ensure its reliability and quality.

Data cleaning is a crucial step

It is important and time-consuming,

but, do you enjoy data cleaning?

Example 1

The following two datasets are the raw and clean data. The Raw dataset contains the headers and other information that triggers missing values. Without cleaning, it cannot be directly used. On the contrary, the clean dataset presented right after has the rows and colomns clearly stroed.

# install.packages("readxl")
Retail <- readxl::read_xlsx(path = "data/RetailSales2018.xlsx")
head(Retail)
## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA>  <NA>  <NA>  <NA> 
## 2 CPI Table 24(http://www.bls.gov/cpi)                   <NA>  <NA>  <NA>  <NA> 
## 3 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 4 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 5 Year                                                   Month Sales S_Fa… CPI  
## 6 1992                                                   JAN   1471… 0.89… 138.1
## # ℹ abbreviated name:
## #   ¹​`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
Retail_clean <- readxl::read_xlsx(path = "data/RetailSales2018_clean.xlsx")
Retail_clean
## # A tibble: 318 × 5
##     Year Month   Sales S_Factor   CPI
##    <dbl> <chr>   <dbl>    <dbl> <dbl>
##  1  1992 JAN   147182.    0.897  138.
##  2  1992 FEB   147013.    0.895  139.
##  3  1992 MAR   159653.    0.975  139.
##  4  1992 APR   163606.    0.993  140.
##  5  1992 MAY   170089.    1.03   140.
##  6  1992 JUN   168922.    1.02   140.
##  7  1992 JUL   169982.    1.02   140.
##  8  1992 AUG   170650.    1.02   141.
##  9  1992 SEP   164664.    0.972  141.
## 10  1992 OCT   173583.    1.02   142.
## # ℹ 308 more rows

Data sources: Retail Sales Raw Data and Retail Sales Clean Data

Data Wrangling in R

In R programming language, data wrangling skills are essential for efficiently manipulating and preparing data for analysis. Some important data wrangling skills in R include:

  1. Data Importing
  2. Data Cleaning
  3. Data Transformation
  4. Data Manipulation
  5. Data Visualization
  6. Handling Dates and Times
  7. Handling Strings
  8. Looping and Conditional Statements
  9. Error Handling
  10. Efficient Coding Practices

By performing thorough data cleaning, business analysts can improve the quality and reliability of their analyses, leading to more accurate insights and better-informed business decisions. It lays the foundation for successful data analysis and ensures that businesses can trust the results derived from their data-driven initiatives.

Data Importing

R supports the importing from various sources such as CSV files, Excel spreadsheets, databases (e.g., MySQL, PostgreSQL), web APIs, and other formats.

Option Function Speed in seconds
Base R read.csv("data/flights.csv") ~ 4 seconds
library(readr) read_csv("data/flights.csv") ~ 0.8 seconds
library(data.table) fread("data/flights.csv") ~ 0.3 seconds
library(readxl) read_excel()
Database: library(DBI) dbSendQuery()
Web APIs: library(httr) GET(url = NULL,...)

What about other flat files?

Examples: read.csv() and read_excel()

We focus on read.csv() function to read in .csv files, and read_excel() to read in .xlsx files.

  1. Example 1: Importing data from a CSV file (Auto.csv)
Auto <- read.csv("data/Auto.csv")
# View(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
  1. Example 2: Importing data from an Excel spreadsheet (RetailSales2018_clean.xlsx)
library(readxl)
Retail <- read_xlsx(path = "data/RetailSales2018.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
head(Retail)
## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA>  <NA>  <NA>  <NA> 
## 2 CPI Table 24(http://www.bls.gov/cpi)                   <NA>  <NA>  <NA>  <NA> 
## 3 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 4 <NA>                                                   <NA>  <NA>  <NA>  <NA> 
## 5 Year                                                   Month Sales S_Fa… CPI  
## 6 1992                                                   JAN   1471… 0.89… 138.1
## # ℹ abbreviated name:
## #   ¹​`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`

Data Cleaning

Skills to clean and preprocess data, which may involve handling missing values (NA), removing duplicates, converting data types, dealing with outliers, and detecting and correcting errors in the dataset using functions like na.omit(), complete.cases(), duplicated(), and various functions from the dplyr and tidyr packages.

Example 1: Removing useless rows

Retail2 <- Retail[-c(1:4),]
head(Retail2)
## # A tibble: 6 × 5
##   RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2  ...3  ...4  ...5 
##   <chr>                                                  <chr> <chr> <chr> <chr>
## 1 Year                                                   Month Sales S_Fa… CPI  
## 2 1992                                                   JAN   1471… 0.89… 138.1
## 3 1992                                                   FEB   1470… 0.89… 138.6
## 4 1992                                                   MAR   1596… 0.97… 139.…
## 5 1992                                                   APR   1636… 0.99… 139.5
## 6 1992                                                   MAY   1700… 1.02… 139.…
## # ℹ abbreviated name:
## #   ¹​`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
# Change colomn names
names(Retail2) <- as.character(Retail2[1, ])
Retail2 <- Retail2[-1, ]
Retail2[1:3,]
## # A tibble: 3 × 5
##   Year  Month Sales              S_Factor            CPI               
##   <chr> <chr> <chr>              <chr>               <chr>             
## 1 1992  JAN   147182.451         0.89700000000000002 138.1             
## 2 1992  FEB   147012.70000000001 0.89500000000000002 138.6             
## 3 1992  MAR   159653.32499999998 0.97499999999999998 139.30000000000001

Example 2: Handling missing values

clean_Retail <- na.omit(Retail2)
dim(clean_Retail)
## [1] 318   5

Example 3: Removing duplicates:

clean_Retail2 <- clean_Retail[!duplicated(clean_Retail), ]
dim(clean_Retail2)
## [1] 318   5

Data Transformation

Ability to transform data by creating new variables, reshaping data from wide to long format (and vice versa), and summarizing data using functions like mutate(), select(), filter(), arrange(), group_by(), summarize() from the dplyr package, and functions like pivot_longer() and pivot_wider() from the tidyr package.

Data Manipulation

Skills to perform various data manipulation tasks such as merging/joining datasets, splitting data into subsets, and reshaping data using functions like merge(), join(), split(), reshape(), rbind(), cbind().

Data Visualization

For quick data exploration, base R plotting functions can provide an expeditious and straightforward approach to understanding your data. Data visualization skills are the abilities to visualize data to explore patterns, relationships, and trends using functions like ggplot2 for creating sophisticated and customizable plots, plot() for basic plots, and ggplotly() from the plotly package for interactive plots.

Handling Dates and Times

Skills to work with date and time data, including parsing, formatting, extracting components (e.g., year, month, day), and performing date arithmetic using functions like as.Date(), as.POSIXct(), strftime(), strptime(), lubridate package.

Handling Strings

Ability to manipulate and process character strings, including pattern matching, substring extraction, and string manipulation using functions like grep(), sub(), gsub(), strsplit(), and stringr package.

Looping and Conditional Statements

Understanding of control structures like loops (for, while) and conditional statements (if, else, ifelse) to automate repetitive tasks and apply conditional operations on data.

Error Handling

Skills to handle errors and exceptions that may occur during data wrangling operations, including debugging techniques and using functions like tryCatch().

Efficient Coding Practices

Familiarity with efficient coding practices such as vectorization, using R’s apply family of functions (apply(), lapply(), sapply(), vapply(), mapply()), and leveraging the capabilities of packages like dplyr and tidyr for faster and more concise code.

These skills are crucial for effectively managing and preparing data for analysis and visualization tasks in R. Continuous practice and exploration of R packages and functions related to data wrangling will enhance proficiency in handling diverse datasets and extracting meaningful insights.

Time to flex our new knowledge muscles!

Let’s apply what you’ve learned to a real case.

go to top